EDA on Black Friday Sale

In [1]:
#import libaries

import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go
import plotly.express as px
import rfit
/Users/xiaoba/opt/anaconda3/lib/python3.8/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED

Cleaning the data

In [2]:
df = pd.read_csv("https://ccadroit.s3.amazonaws.com/cloudComputing/train.csv")
print(df)
df = df.drop(["User_ID","Product_ID"],axis=1)
df.head()
df.tail()
        User_ID Product_ID Gender    Age  Occupation City_Category  \
0       1000001  P00069042      F   0-17          10             A   
1       1000001  P00248942      F   0-17          10             A   
2       1000001  P00087842      F   0-17          10             A   
3       1000001  P00085442      F   0-17          10             A   
4       1000002  P00285442      M    55+          16             C   
...         ...        ...    ...    ...         ...           ...   
550063  1006033  P00372445      M  51-55          13             B   
550064  1006035  P00375436      F  26-35           1             C   
550065  1006036  P00375436      F  26-35          15             B   
550066  1006038  P00375436      F    55+           1             C   
550067  1006039  P00371644      F  46-50           0             B   

       Stay_In_Current_City_Years  Marital_Status  Product_Category_1  \
0                               2               0                   3   
1                               2               0                   1   
2                               2               0                  12   
3                               2               0                  12   
4                              4+               0                   8   
...                           ...             ...                 ...   
550063                          1               1                  20   
550064                          3               0                  20   
550065                         4+               1                  20   
550066                          2               0                  20   
550067                         4+               1                  20   

        Product_Category_2  Product_Category_3  Purchase  
0                      NaN                 NaN      8370  
1                      6.0                14.0     15200  
2                      NaN                 NaN      1422  
3                     14.0                 NaN      1057  
4                      NaN                 NaN      7969  
...                    ...                 ...       ...  
550063                 NaN                 NaN       368  
550064                 NaN                 NaN       371  
550065                 NaN                 NaN       137  
550066                 NaN                 NaN       365  
550067                 NaN                 NaN       490  

[550068 rows x 12 columns]
Out[2]:
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
550063 M 51-55 13 B 1 1 20 NaN NaN 368
550064 F 26-35 1 C 3 0 20 NaN NaN 371
550065 F 26-35 15 B 4+ 1 20 NaN NaN 137
550066 F 55+ 1 C 2 0 20 NaN NaN 365
550067 F 46-50 0 B 4+ 1 20 NaN NaN 490

Encoding Catagorical

In [3]:
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
#df['Age'] = LabelEncoder().fit_transform(df['Age'])
df['City_Category'] = LabelEncoder().fit_transform(df['City_Category'])
df['Product_Category_2'] =df['Product_Category_2'].fillna(0).astype('int64')
df['Product_Category_3'] =df['Product_Category_3'].fillna(0).astype('int64')
df.head()
df.tail()
Out[3]:
Gender Age Occupation City_Category Stay_In_Current_City_Years Marital_Status Product_Category_1 Product_Category_2 Product_Category_3 Purchase
550063 1 51-55 13 1 1 1 20 0 0 368
550064 0 26-35 1 2 3 0 20 0 0 371
550065 0 26-35 15 1 4+ 1 20 0 0 137
550066 0 55+ 1 2 2 0 20 0 0 365
550067 0 46-50 0 1 4+ 1 20 0 0 490

Check null values

In [4]:
df.isnull().sum()/df.shape[0]*100
Out[4]:
Gender                        0.0
Age                           0.0
Occupation                    0.0
City_Category                 0.0
Stay_In_Current_City_Years    0.0
Marital_Status                0.0
Product_Category_1            0.0
Product_Category_2            0.0
Product_Category_3            0.0
Purchase                      0.0
dtype: float64

EDA

Question 1

Who spend much on Black Friday Sale?

In [5]:
fig = px.box(df, x="Gender", y="Purchase")
fig.update_traces(quartilemethod="exclusive") # or "inclusive", or "linear" by default
fig.show()

Finding: Men spend much on Black Friday sale

Question 2

People between what age are more likely to purchsase?

In [6]:
sns.countplot(df['Age'])
plt.title('Distribution of Age')
plt.xlabel('Different Categories of Age')
plt.show()

Finding: People between age 26-35 are more interested on black Friday Shopping

Question 3

Between married and unmarried who are more active on blackfriday sale

In [7]:
df.groupby("Marital_Status").mean()["Purchase"]
df.groupby("Marital_Status").mean()["Purchase"].plot(kind='bar')
plt.title("Marital_Status and Purchase Analysis")
plt.show()
<ipython-input-7-0f720fbe61ba>:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

<ipython-input-7-0f720fbe61ba>:2: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Finding: Unmarried are more active on blackfriday sale

Question 4

Look for change in the categories of city with highest purchases. And are buying more among new comer and longest people live in city.

In [8]:
sns.countplot(df['City_Category'])
plt.show()

Finding: It is observed that city category 2 has made the most number of puchases.

In [9]:
df.groupby("City_Category").mean()["Purchase"].plot(kind='bar')
plt.title("City Category and Purchase Analysis")
plt.show()
<ipython-input-9-33e6005752a2>:1: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

However, the city whose buyers spend the most is city type 2

In [10]:
sns.countplot(df['Stay_In_Current_City_Years'])
plt.show()

Finding:It looks like the longest someone is living in that city the less prone they are to buy new things.